-
Notifications
You must be signed in to change notification settings - Fork 309
Add amdgpu intrinsics #1976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add amdgpu intrinsics #1976
Conversation
|
I think it can be "tested" the same way we do |
d6043df to
e102811
Compare
|
Thanks, I tried to replicate what’s there for nvptx + adding The diff to my first push from when opening this PR is here: https://github.com/rust-lang/stdarch/compare/b3f5bdae0efbdd5f7297d0225623bd31c7fe895b..e1028110e77561574bfb7ea349154d46b5ea7b86 |
sayantn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I have put some comments
|
Also I have noticed that Clang provides a different set of intrinsics than these ( |
|
Thanks for the review! Interesting, I thought I do plan to write a common mod amdgpu {
mod gpu {
use super::*;
pub fn block_id_x() -> u32 {
workitem_id_x()
}
}
}
// Same for nvptx
mod gpu {
#[cfg(target_arch = "amdgpu")]
pub use amdgpu::gpu::*;
#[cfg(target_arch = "nvptx64")]
pub use nvptx::gpu::*;
// + more intrinsics as in gpuintrin.h
} |
|
The analogue to If there are some interesting platform-specific intrinsics, we can add them, but generally we follow GCC and Clang in regards to intrinsics. Yea, a common |
Add intrinsics for the amdgpu architecture.
e102811 to
d850408
Compare
|
Thanks for the review, that’s a nice case for const generics! I fixed all your inline comments. Also thanks for linking the clang intrinsics, I tend to forget about those. As common GPU intrinsics is a larger topic, I started a Discourse post here (please add your thoughts!): https://internals.rust-lang.org/t/naming-gpu-things-in-the-rust-compiler-and-standard-library/23833 My planned intrinsic “stack” is roughly as follows, from high-level to low-level:
We could also use clang names for
I would prefer following LLVM’s intrinsic names in Note for my future self: I got run.sh to work with |
sayantn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, left some more comments
| //! [LLVM implementation]: https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/IR/IntrinsicsAMDGPU.td | ||
| #[allow(improper_ctypes)] | ||
| unsafe extern "unadjusted" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there some specific reason you only made some of the (both LLVM and Rust) intrinsics safe? Normally we make an intrinsic unsafe only if it can cause memory unsafety, and I don't see how something like s.barrier.signal is unsafe, where s.barrier is safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I tried to follow the Rust safe/unsafe principles.
Every intrinsic that is marked safe will not cause undefined behavior or memory safety issues, not matter with which arguments it is called. They could still cause other issues, like deadlocks if s.barrier is used incorrectly, but this is still “safe” in the Rust sense.
Intrinsics that are unsafe can cause undefined behavior, either in the compiler/LLVM or in the hardware when calling them with incorrect arguments. E.g. readlane or permlane can read registers from inactive threads (which I guess is either directly UB in LLVM or poison, which can cause UB further down the line).
For s.barrier.signal, I’m not sure what the allowed BARRIER_TYPEs are. What happens when passing an invalid number for the type seems to be currently unspecified (the documentation says “only for non-named barriers”) and could very well become undefined behavior with a future LLVM change.
| fn llvm_s_barrier_leave(barrier_type: u16); | ||
| #[link_name = "llvm.amdgcn.s.get.barrier.state"] | ||
| fn llvm_s_get_barrier_state(barrier_type: i32) -> u32; | ||
| #[link_name = "llvm.amdgcn.s.wave.barrier"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #[link_name = "llvm.amdgcn.s.wave.barrier"] | |
| #[link_name = "llvm.amdgcn.wave.barrier"] |
typo (probably also rename the intrinsics to (llvm_)wave_barrier
| safe fn llvm_wave_reduce_add(value: u32, strategy: u32) -> u32; | ||
| #[link_name = "llvm.amdgcn.wave.reduce.fadd"] | ||
| safe fn llvm_wave_reduce_fadd(value: f32, strategy: u32) -> f32; | ||
| #[link_name = "llvm.amdgcn.wave.reduce.and"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: I see that LLVM also defined fsub and sub versions of these intrinsics, why aren't they included here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn’t make much sense of the sub variants 😄
The reduce intrinsics do some accumulation/reduction over the 32 or 64 threads in a wave.
This can be neatly implemented as a parallel reduce (i.e. a binary tree where first thread 0+1→0, 2+3→2, … are added, then 0+2, 4+6, etc.; LLVM currently doesn’t do this… but that’s a different problem).
Doing an add reduce of multiple values (over a wave here) makes sense to me. add is an associative operation.
Doing a sub doesn’t make sense to me, because it’s not associative. The order of operations matters, so what is a parallel sub reduce supposed to do? (Also, is the very first value kept positive or subtracted from 0?)
I looked into the LLVM implementation, it seems to do a sum (add reduce) and then multiply by -1. If someone needed that functionality, I think it would be much clearer to write exactly that, reduce.add * -1. It would also be as efficient as the reduce.sub intrinsic as it doesn’t do anything different in the implementation.
| safe fn llvm_s_get_waveid_in_workgroup() -> u32; | ||
|
|
||
| // gfx11 | ||
| #[link_name = "llvm.amdgcn.permlane64"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #[link_name = "llvm.amdgcn.permlane64"] | |
| #[link_name = "llvm.amdgcn.permlane64.i32"] |
the type parameter should be specified
| #[inline] | ||
| #[unstable(feature = "stdarch_amdgpu", issue = "149988")] | ||
| pub unsafe fn sched_barrier<const MASK: u32>() { | ||
| static_assert_uimm_bits!(MASK, 11); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check enough? It seems like MASK can only take values mentioned above, not any linear combination of them -- in that case we should also check MASK.is_power_of_two() or something like that. Also nit, it might be better to make some constants for these values, but I'll leave it to your discretion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it used like a proper bitmask here (15/0xf), so I think it’s fine, but thanks for checking!
https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sched.barrier.ll#L16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason when I tried to build this with all #[inline]s replaced by #[inline(never)] (so that the functions actually codegen), I get a segfault for permlane64_u32, and LLVM selection errors for s.get.waveid.in.workgroup and wave.id. This is probably due to missing target features or something similar, I noticed that you made comments that these are only available in gfx12 and higher (probably, my knowledge of GPUs is nonexistent), but we are building with target-cpu=gfx900.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I can reproduce this.
There are quite a few “target-cpu”s (gfx numbers). Every generation of GPUs adds four or more of them.
E.g. for the RX 90xx series, there’s gfx1200 (RX 9060 (XT)), gfx1201 (RX 9070 (XT)), gfx1250 and gfx1251 (for APUs).
The full list is available here: https://llvm.org/docs/AMDGPUUsage.html#processors
The HPC/datacenter cards (MI 100, MI 250, … in case you’ve heard the names) are building upon gfx9, so new releases are gfx90a, gfx942, etc.
The amdgpu LLVM backend has about one target feature per intrinsic and a large list of target features for every generation and variant.
I’m hesitant to introduce such things as target-features to Rust as it quickly becomes a maintenance burden.
|
I don't have any strong preferences, if you feel that the lower-level LLVM intrinsics are more useful that just having the clang intrinsics, go ahead with that -- we can change it, remove it, or add the clang intrinsics too if required (as long as they are unstable) |
Add intrinsics for the amdgpu architecture.
I’m not sure how to add/run CI (
ci/run.shfails for me e.g. fornvptx because
corecannot be found), but I checked that it compileswithout warnings with
Tracking issue: rust-lang/rust#149988